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Abstract 

NUMA is becoming more widespread in the 
marketplace, used on many systems, small or 
large, particularly with the advent of AMD 
Opteron systems. This paper will cover a sum- 
mary of the current state of NUMA, and future 
developments, encompassing the VM subsys- 
tem, scheduler, topology (CPU, memory, I/O 
layouts including complex non-uniform lay- 
outs), userspace interface APIs, and network 
and disk I/O locality. It will take a broad-based 
approach, focusing on the challenges of creat- 
ing subsystems that work for all machines (in- 
cluding AMD64, PPC64, IA-32, IA-64, etc.), 
rather than just one architecture. 

1 What is a NUMA machine? 

NUMA stands for non-uniform memory archi- 
tecture. Typically this means that not all mem- 
ory is the same “distance” from each CPU in 
the system, but also applies to other features 
such as I/O buses. The word “distance” in this 
context is generally used to refer to both la- 
tency and bandwidth. Typically, NUMA ma- 
chines can access any resource in the system, 
just at different speeds. 


NUMA systems are sometimes measured with 
a simple “NUMA factor” ratio of N:1 — 
meaning that the latency for a cache miss mem- 
ory read from remote memory is N times the la- 
tency for that from local memory (for NUMA 
machines, N > 1). Whilst such a simple de- 
scriptor is attractive, it can also be highly mis- 
leading, as it describes latency only, not band- 
width, on an uncontended bus (which is not 
particularly relevant or interesting), and takes 
no account of inter-node caches. 

The term node is normally used to describe a 
grouping of resources — e.g., CPUs, memory, 
and I/O. On some systems, a node may con- 
tain only some types of resources (e.g., only 
memory, or only CPUs, or only I/O); on oth- 
ers it may contain all of them. The intercon- 
nect between nodes may take many different 
forms, but can be expected to be higher latency 
than the connection within a node, and typi- 
cally lower bandwidth. 

Programming for NUMA machines generally 
implies focusing on locality — the use of re- 
sources close to the device in question, and 
trying to reduce traffic between nodes; this 
type of programming generally results in bet- 
ter application throughput. On some machines 
with high-speed cross-node interconnects, bet- 
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ter performance may be derived under certain 
workloads by “striping” accesses across mul- 
tiple nodes, rather than just using local re- 
sources, in order to increase bandwidth. Whilst 
it is easy to demonstrate a benchmark that 
shows improvement via this method, it is dif- 
ficult to be sure that the concept is generally 
bene heal (i.e., with the machine under full 
load). 

2 Why use a NUMA architecture to 
build a machine? 

The intuitive approach to building a large ma- 
chine, with many processors and banks of 
memory, would be simply to scale up the typ- 
ical 2-4 processor machine with all resources 
attached to a shared system bus. However, re- 
strictions of electronics and physics dictate that 
accesses slow as the length of the bus grows, 
and the bus is shared amongst more devices. 

Rather than accept this global slowdown for a 
larger machine, designers have chosen to in- 
stead give fast access to a limited set of local 
resources, and reserve the slower access times 
for remote resources. 

Historically, NUMA architectures have only 
been used for larger machines (more than 4 
CPUs), but the advantages of NUMA have 
been brought into the commodity marketplace 
with the advent of AMD’s x86-64, which has 
one CPU per node, and local memory for each 
processor. Linux supports NUMA machines 
of every size from 2 CPUs upwards (e.g., SGI 
have machines with 512 processors). 

It might help to envision the machine as a 
group of standard SMP machines, connected 
by a very fast interconnect somewhat like a net- 
work connection, except that the transfers over 
that bus are transparent to the operating sys- 
tem. Indeed, some earlier systems were built 


exactly like that; the older Sequent NUMA- 
Q hardware uses a standard 450NX 4 proces- 
sor chipset, with an SCI interconnect plugged 
into the system bus of each node to unify them, 
and pass traffic between them. The complex 
part of the implementation is to ensure cache- 
coherency across the interconnect, and such 
machines are often referred to as CC-NUMA 
(cache coherent NUMA). As accesses over the 
interconnect are transparent, it is possible to 
program such machines as if they were stan- 
dard SMP machines (though the performance 
will be poor). Indeed, this is exactly how the 
NUMA-Q machines were first bootstrapped. 

Often, we are asked why people do not use 
clusters of smaller machines, instead of a large 
NUMA machine, as clusters are cheaper, sim- 
pler, and have a better price:performance ra- 
tio. Unfortunately, it makes the programming 
of applications much harder; all of the inter- 
communication and load balancing now has to 
be more explicit. Some large applications (e.g., 
database servers) do not split up across mul- 
tiple cluster nodes easily — in those situations, 
people often use NUMA machines. In addi- 
tion, the interconnect for NUMA boxes is nor- 
mally very low latency, and very high band- 
width, yielding excellent performance. The 
management of a single NUMA machine is 
also simpler than that of a whole cluster with 
multiple copies of the OS. 

We could either have the operating system 
make decisions about how to deal with the ar- 
chitecture of the machine on behalf of the user 
processes, or give the userspace application an 
API to specify how such decisions are to be 
made. It might seem, at first, that the userspace 
application is in a better position to make such 
decisions, but this has two major disadvan- 
tages: 

1 . Every application must be changed to sup- 
port NUMA machines, and may need to 



be revised when a new hardware platform 
is released. 

2. Applications are not in a good position 
to make global holistic decisions about 
machine resources, coordinate themselves 
with other applications, and balance deci- 
sions between them. 

Thus decisions on process, memory and I/O 
placement are normally best left to the oper- 
ating system, perhaps with some hints from 
userspace about which applications group to- 
gether, or will use particular resources heavily. 
Details of hardware layout are put in one place, 
in the operating system, and tuning and modi- 
fication of the necessary algorithms are done 
once in that central location, instead of in ev- 
ery application. In some circumstances, the 
application or system administrator will want 
to override these decisions with explicit APIs, 
but this should be the exception, rather than the 
norm. 

3 Linux NUMA Memory Support 

In order to manage memory, Linux requires 
a page descriptor structure (struct page) 
for each physical page of memory present in 
the system. This consumes approximately 1% 
of the memory managed (assuming 4K page 
size), and the structures are grouped into an ar- 
ray called mem_map. For NUMA machines, 
there is a separate array for each node, called 
lmem_map. The mem_map and lmem_map 
arrays are simple contiguous data structures ac- 
cessed in a linear fashion by their offset from 
the beginning of the node. This means that the 
memory controlled by them is assumed to be 
physically contiguous. 

NUMA memory support is enabled by 
CONF I G_D I S CONT I GMEM and CONFIG_ 
NUMA. A node descriptor called a struct 
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pgdata_t is created for each node. Cur- 
rently we do not support discontiguous mem- 
ory within a node (though large gaps in the 
physical address space are acceptable between 
nodes). Thus we must still create page descrip- 
tor structures for “holes” in memory within a 
node (and then mark them invalid), which will 
waste memory (potentially a problem for large 
holes). 

Dave McCracken has picked up Daniel 
Phillips’ earlier work on a better data struc- 
ture for holding the page descriptors, called 
CONF I G_NONL I NEAR. This will allow the 
mapping of discontigous memory ranges in- 
side each node, and greatly simplify the ex- 
isting code for discontiguous memory on non- 
NUMA machines. 

CONFIG_NONLINEAR solves the problem by 
creating an artificial layer of linear addresses. 
It does this by dividing the physical address 
space into fixed size sections (akin to very 
large pages), then allocating an array to allow 
translations from linear physical address to true 
physical address. This added level of indirec- 
tion allows memory with widely differing true 
physical addresses to appear adjacent to the 
page allocator and to be in the same zone, with 
a single struct page array to describe them. It 
also provides support for memory hotplug by 
allowing new physical memory to be added to 
an existing zone and struct page array. 

Linux normally allocates memory for a process 
on the local node, i.e., the node that the pro- 
cess is currently running on. alloc_pages 
will call alloc_pages_node for the cur- 
rent processor’s node, which will pass the rele- 
vant zonelist (pgdat->node_zonelists) 

to the core allocator ( al loc_pages). The 

zonelists are built by build_zonelist s, 
and are set up to allocate memory in a round- 
robin fashion, starting from the local node (this 
creates a roughly even distribution of memory 
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pressure). 

In the interest of reducing cross-node traffic, 
and reducing memory access latency for fre- 
quently accessed data and text, it is desirable 
to replicate any such memory that is read-only 
to each node, and use the local copy on any ac- 
cesses, rather than a remote copy. The obvious 
candidates for such replication are the kernel 
text itself, and the text of shared libraries such 
as libc. Of course, this faster access comes 
at the price of increased memory usage, but 
this is rarely a problem on large NUMA ma- 
chines. Whilst it might be technically possible 
to replicate read/write mappings, this is com- 
plex, of dubious utility, and is unlikely to be 
implemented. 

Kernel text is assumed by the kernel itself to 
appear at a fixed virtual address, and to change 
this would be problematic. Hence the easiest 
way to replicate it is to change the virtual to 
physical mappings for each node to point at a 
different address. On IA-64, this is easy, since 
the CPU provides hardware assistance in the 
form of a pinned TLB entry. 

On other architectures this proves more diffi- 
cult, and would depend on the structure of the 
pagetables. On IA-32 with PAE enabled, as 
long as the user-kernel split is aligned on a 
PMD boundary, we can have a separate ker- 
nel PMD for each node, and point the vmalloc 
area (which uses small page mappings) back to 
a globally shared set of PTE pages. The PMD 
entries for the ZONE_NORMAL areas normally 
never change, so this is not an issue, though 
there is an issue with ioremap_nocache 
that can change them (GART trips over this) 
and speculative execution means that we will 
have to deal with that (this can be a slow-path 
that updates all copies of the PMDs though). 

Dave Hansen has created a patch to replicate 
read only pagecache data, by adding aper-node 
data structure to each node of the pagecache 


radix tree. As soon as any mapping is opened 
for write, the replication is collapsed, making 
it safe. The patch gives a 5%-40% increase in 
performance, depending on the workload. 

In the 2.6 Linux kernel, we have a per-node 
LRU for page management and a per-node 
LRU lock, in place of the global structures 
and locks of 2.4. Not only does this reduce 
contention through finer grained locking, it 
also means we do not have to search other 
nodes’ page lists to free up pages on one node 
which is under memory pressure. Moreover, 
we get much better locality, as only the lo- 
cal kswapd process is accessing that node’s 
pages. Before splitting the LRU into per-node 
lists, we were spending 50% of the system time 
during a kernel compile just spinning wait- 
ing for pagemap_lru_lock (which was the 
biggest global VM lock at the time). Con- 
tention for the pagemap_lru_lock is now 
so small it is not measurable. 

4 Sched Domains — a Topology- 
aware Scheduler 

The previous Linux scheduler, the 0(1) sched- 
uler, provided some needed improvements to 
the 2.4 scheduler, but shows its age as more 
complex system topologies become more and 
more common. With technologies such as 
NUMA, Symmetric Multi-Threading (SMT), 
and variations and combinations of these, the 
need for a more flexible mechanism to model 
system topology is evident. 

4.1 Overview 

In answer to this concern, the mainline 2.6 
tree (linux-2.6.7-rcl at the time of this writing) 
contains an updated scheduler with support for 
generic CPU topologies with a data structure, 
struct sched_domain, that models the 
architecture and defines scheduling policies. 
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Simply speaking, sched domains group CPUs 
together in a hierarchy that mimics that of the 
physical hardware. Since CPUs at the bot- 
tom of the hierarchy are most closely related 
(in terms of memory access), the new sched- 
uler performs load balancing most often at the 
lower domains, with decreasing frequency at 
each higher level. 

Consider the case of a machine with two SMT 
CPUs. Each CPU contains a pair of virtual 
CPU siblings which share a cache and the core 
processor. The machine itself has two physi- 
cal CPUs which share main memory. In such 
a situation, treating each of the four effective 
CPUs the same would not result in the best 
possible performance. With only two tasks, 
for example, the scheduler should place one 
on CPUO and one on CPU2, and not on the 
two virtual CPUs of the same physical CPU. 
When running several tasks it seems natural to 
try to place newly ready tasks on the CPU they 
last ran on (hoping to take advantage of cache 
warmth). However, virtual CPU siblings share 
a cache; a task that was running on CPUO, 
then blocked, and became ready when CPUO 
was running another task and CPU 1 was idle, 
would ideally be placed on CPU1. Sched do- 
mains provide the structures needed to realize 
these sorts of policies. With sched domains, 
each physical CPU represents a domain con- 
taining the pair of virtual siblings, each repre- 
sented in a sched_group structure. These 
two domains both point to a parent domain 
which contains all four effective processors in 
two sched_group structures, each contain- 
ing a pair of virtual siblings. Figure 1 illus- 
trates this hierarchy. 

Next consider a two-node NUMA machine 
with two processors per node. In this example 
there are no virtual sibling CPUs, and there- 
fore no shared caches. When a task becomes 
ready and the processor it last ran on is busy, 
the scheduler needs to consider waiting un- 



Figure 1: SMT Domains 


til that CPU is available to take advantage of 
cache warmth. If the only available CPU is 
on another node, the scheduler must carefully 
weigh the costs of migrating that task to an- 
other node, where access to its memory will 
be slower. The lowest level sched domains in 
a machine like this will contain the two pro- 
cessors of each node. These two CPU level 
domains each point to a parent domain which 
contains the two nodes. Figure 2 illustrates this 
hierarchy. 



Figure 2: NUMA Domains 


The next logical step is to consider an SMT 
NUMA machine. By combining the previous 
two examples, the resulting sched domain hier- 
archy has three levels, sibling domains, physi- 
cal CPU domains, and the node domain. Fig- 
ure 3 illustrates this hierarchy. 

The unique AMD Opteron architecture war- 
rants mentioning here as it creates a NUMA 
system on a single physical board. In this case, 
however, each NUMA node contains only one 
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Figure 3: SMT NUMA Domains 


physical CPU. Without careful consideration 
of this property, a typical NUMA sched do- 
mains hierarchy would perform badly, trying 
to load balance single CPU nodes often (an ob- 
vious waste of cycles) and between node do- 
mains only rarely (also bad since these actually 
represent the physical CPUs). 

4.2 Sched Domains Implementation 
4.2.1 Structure 

The sched_domain structure stores pol- 
icy parameters and flags and, along with 
the sched_group structure, is the primary 
building block in the domain hierarchy. Fig- 
ure 4 describes these structures. The sched_ 
domain structure is constructed into an up- 
wardly traversable tree via the parent pointer, 
the top level domain setting parent to NULL. 
The groups list is a circular list of of sched_ 
group structures which essentially define the 
CPUs in each child domain and the relative 
power of that group of CPUs (two physical 
CPUs are more powerful than one SMT CPU). 
The span member is simply a bit vector with a 
1 for every CPU encompassed by that domain 
and is always the union of the bit vector stored 


in each element of the groups list. The remain- 
ing fields define the scheduling policy to be fol- 
lowed while dealing with that domain, see Sec- 
tion 4.2.2. 

While the hierarchy may seem simple, the de- 
tails of its construction and resulting tree struc- 
tures are not. For performance reasons, the 
domain hierarchy is built on a per-CPU basis, 
meaning each CPU has a unique instance of 
each domain in the path from the base domain 
to the highest level domain. These duplicate 
structures do share the sched_group struc- 
tures however. The resulting tree is difficult to 
diagram, but resembles Figure 5 for the ma- 
chine with two SMT CPUs discussed earlier. 

In accordance with common practice, each 
architecture may specify the construction of 
the sched domains hierarchy and the pa- 
rameters and flags defining the various poli- 
cies. At the time of this writing, only i386 
and ppc64 defined custom construction rou- 
tines. Both architectures provide for SMT 
processors and NUMA configurations. With- 
out an architecture-specific routine, the kernel 
uses the default implementations in sched . c, 
which do take NUMA into account. 
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struct sched_domain { 

/* These fields must be setup */ 
struct sched_domain ^parent; 
struct sched_group ^groups; 
cpumask_t span; 
unsigned long min_interval; 
unsigned long max_interval; 
unsigned int busy_factor; 
unsigned int imbalance_pct ; 
unsigned long long cache_hot_time; 
unsigned int cache_nice_tries; 
unsigned int per_cpu_gain; 
int flags; 

/* Runtime fields. */ 
unsigned long last_balance; 
unsigned int balance_interval; 
unsigned int nr_balance_f ailed; 

}; 

struct sched_group { 

struct sched_group *next; 
cpumask_t cpumask; 
unsigned long cpu_power; 


/* top domain must be null terminated */ 
/* the balancing groups of the domain */ 
/* span of all CPUs in this domain */ 

/* Minimum balance interval ms */ 

/* Maximum balance interval ms */ 

/* less balancing by factor if busy */ 

/* No balance until over watermark */ 

/* Task considered cache hot (ns) */ 

/* Leave cache hot tasks for # tries */ 
/* CPU % gained by adding domain cpus */ 
/* See SD_* */ 

/* init to jiffies, units in jiffies */ 
/* initialise to 1. units in ms. */ 

/* initialise to 0 */ 


/* Must be a circular list */ 


Figure 4: Sched Domains Structures 


4.2.2 Policy 

The new scheduler attempts to keep the sys- 
tem load as balanced as possible by running re- 
balance code when tasks change state or make 
specific system calls, we will call this event 
balancing, and at specified intervals measured 
in jiffies, called active balancing. Tasks must 
do something for event balancing to take place, 
while active balancing occurs independent of 
any task. 

Event balance policy is defined in each 
sched_domain structure by setting a com- 
bination of the #defines of figure 6 in the flags 
member. 

To define the policy outlined for the dual SMT 
processor machine in Section 4.1, the low- 
est level domains would set SD_BALANCE_ 
NEWIDLE and SD_WAKE_IDLE (as there is 
no cache penalty for running on a differ- 
ent sibling within the same physical CPU), 
SD_SHARE_CPUPOWER to indicate to the 
scheduler that this is an SMT processor (the 


scheduler will give full physical CPU ac- 
cess to a high priority task by idling the 
virtual sibling CPU), and a few common 
flags SD_BALANCE_EXEC, SD_BALANCE_ 
CLONE, and SD_WAKE_AFF INE. The next 
level domain represents the physical CPUs 
and will not set SD_WAKE_IDLE since cache 
warmth is a concern when balancing across 
physical CPUs, nor SD_SHARE_CPUPOWER. 
This domain adds the SD_WAKE_BALANCE 
flag to compensate for the removal of SD_ 
WAKE_IDLE. As discussed earlier, an SMT 
NUMA system will have these two domains 
and another node-level domain. This do- 
main removes the SD_BALANCE_NEWIDLE 
and SD_WAKE_AFFINE flags, resulting in 
far fewer balancing across nodes than within 
nodes. When one of these events occurs, the 
scheduler search up the domain hierarchy and 
performs the load balancing at the highest level 
domain with the corresponding flag set. 

Active balancing is fairly straightforward and 
aids in preventing CPU-hungry tasks from hog- 
ging a processor, since these tasks may only 
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#def ine 

SD_ 

_B AL AN CE_NE W I D L E 

i 

/* 

Balance 

when about to become idle */ 

#def ine 

SD_ 

_BALANCE_EXEC 

2 

/* 

Balance 

on exec */ 


#def ine 

SD_ 

_balance_clone 

4 

/* 

Balance 

on clone */ 


#def ine 

SD_ 

_WAKE_IDLE 

8 

/* 

Wake to 

idle CPU on 

task wakeup */ 

#def ine 

SD_ 

_WAKE_AFF INE 

16 

/* 

Wake task to waking 

CPU */ 

#def ine 

SD_ 

_WAKE_BALANCE 

32 

/* 

Perform 

balancing at 

task wakeup */ 

#def ine 

SD_ 

_SHARE_CPUPOWER 

64 

/* 

Domain members share 

cpu power */ 


Figure 6: Sched Domains Policies 
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Kernbench Results 
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Figure 7 : Kernbench Results 



Figure 5: Per CPU Domains 


rarely trigger event balancing. At each re- 
balance tick, the scheduler starts at the low- 
est level domain and works its way up, check- 
ing the balance_interval and last_ 
balance fields to determine if that domain 
should be balanced. If the domain is already 
busy, the balance_interval is adjusted 
using the busy_factor field. Other fields 
define how out of balance a node must be be- 
fore rebalancing can occur, as well as some 
sane limits on cache hot time and min and max 
balancing intervals. As with the flags for event 
balancing, the active balancing parameters are 
defined to perform less balancing at higher do- 
mains in the hierarchy. 


To compare the 0(1) scheduler of mainline 
with the sched domains implementation in the 
mm tree, we ran kernbench (with the - j option 
to make set to 8, 16, and 32) on a 16 CPU SMT 
machine (32 virtual CPUs) on linux-2.6.6 and 
linux-2.6.6-mm3 (the latest tree with sched do- 
mains at the time of the benchmark) with and 
without CONFIG_SCHED_SMT enabled. The 
results are displayed in Figure 7. The 0(1) 
scheduler evenly distributed compile tasks ac- 
cross virtual CPUs, forcing tasks to share cache 
and computational units between virtual sib- 
ling CPUs. The sched domains implementa- 
tion with CONFIG_SCHED_SMT enabled bal- 
anced the load accross physical CPUs, making 
far better use of CPU resources when running 
fewer tasks than CPUs (as in the j8 case) since 
each compile task would have exclusive access 
to the physical CPU. Surprisingly, sched do- 
mains (which would seem to have more over- 
head than the mainline scheduler) even showed 
improvement for the j32 case, where it doesn’t 





benefit from balancing across physical CPUs 
before virtual CPUs as there are more tasks 
than virtual CPUs. Considering the sched do- 
mains implementation has not been heavily 
tested or tweaked for performance, some fine 
tuning is sure to further improve performance. 

The sched domains structures replace the ex- 
panding set of #ifdefs of the 0(1) sched- 
uler, which should improve readability and 
maintainability. Unfortunately, the per CPU 
nature of the domain construction results in a 
non-intuitive structure that is difficult to work 
with. For example, it is natural to discuss the 
policy defined at “the” top level domain; un- 
fortunately there are NR_CPUS top level do- 
mains and, since they are self-adjusting, each 
one could conceivably have a different set of 
flags and parameters. Depending on which 
CPU the scheduler was running on, it could be- 
have radically differently. As an extension of 
this research, an effort to analyze the impact of 
a unified sched domains hierarchy is needed, 
one which only creates one instance of each 
domain. 

Sched domains provides a needed structural 
change to the way the Linux scheduler views 
modern architectures, and provides the pa- 
rameters needed to create complex scheduling 
policies that cater to the strengths and weak- 
nesses of these systems. Currently only 1386 
and ppc64 machines benefit from arch specific 
construction routines; others must now step 
forward and fill in the construction and param- 
eter setting routines for their architecture of 
choice. There is still plenty of fine tuning and 
performance tweaking to be done. 
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5 NUMA API 

5.1 Introduction 

One of the biggest impediments to the ac- 
ceptance of a NUMA API for Linux was a 
lack of understanding of what its potential uses 
and users would be. There are two schools 
of thought when it comes to writing NUMA 
code. One says that the OS should take care 
of all the NUMA details, hide the NUMA- 
ness of the underlying hardware in the ker- 
nel and allow userspace applications to pre- 
tend that it’s a regular SMP machine. Linux 
does this by having a process scheduler and 
a VMM that make intelligent decisions based 
on the hardware topology presented by arch- 
specific code. The other way to handle NUMA 
programming is to provide as much detail as 
possible about the system to userspace and 
allow applications to exploit the hardware to 
the fullest by giving scheduling hints, mem- 
ory placement directives, etc., and the NUMA 
API for Linux handles this. Many applications, 
particularly larger applications with many con- 
current threads of execution, cannot fully uti- 
lize a NUMA machine with the default sched- 
uler and VM behavior. Take, for example, a 
database application that uses a large region of 
shared memory and many threads. This appli- 
cation may have a startup thread that initiali z es 
the environment, sets up the shared memory 
region, and forks off the worker threads. The 
default behavior of Linux’s VM for NUMA is 
to bring pages into memory on the node that 
faulted them in. This behavior for our hy- 
pothetical app would mean that many pages 
would get faulted in by the startup thread on 
the node it is executing on, not necessarily on 
the node containing the processes that will ac- 
tually use these pages. Also, the forked worker 
threads would get spread around by the sched- 
uler to be balanced across all the nodes and 
their CPUs, but with no guarantees as to which 
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threads would be associated with which nodes. 
The NUMA API and scheduler affinity syscalls 
allow this application to specify that its threads 
be pinned to particular CPUs and that its mem- 
ory be placed on particular nodes. The appli- 
cation knows which threads will be working 
with which regions of memory, and is better 
equipped than the kernel to make those deci- 
sions. 

The Linux NUMA API allows applications 
to give regions of their own virtual memory 
space specific allocation behaviors, called poli- 
cies. Currently there are four supported poli- 
cies: PREFERRED, BIND, INTERLEAVE, 
and DEFAULT. The DEFAULT policy is the 
simplest, and tells the VMM to do what it 
would normally do (ie: pre-NUMA API) for 
pages in the policied region, and fault them 
in from the local node. This policy applies 
to all regions, but is overridden if an appli- 
cation requests a different policy. The PRE- 
FERRED policy allows an application to spec- 
ify one node that all pages in the policied re- 
gion should come from. However, if the spec- 
ified node has no available pages, the PRE- 
FERRED policy allows allocation to fall back 
to any other node in the system. The BIND 
policy allows applications to pass in a node- 
mask, a bitmap of nodes, that the VM is re- 
quired to use when faulting in pages from a re- 
gion. The fourth policy type, INTERLEAVE, 
again requires applications to pass in a node- 
mask, but with the INTERLEAVE policy, the 
nodemask is used to ensure pages are faulted 
in in a round-robin fashion from the nodes 
in the nodemask. As with the PREFERRED 
policy, the INTERLEAVE policy allows page 
allocation to fall back to other nodes if nec- 
essary. In addition to allowing a process to 
policy a specific region of its VM space, the 
NUMA API also allows a process to policy 
its entire VM space with a process-wide pol- 
icy, which is set with a different syscall: set_ 
mempolicy ( ) . Note that process-wide poli- 


cies are not persistent over swapping, however 
per-VMA policies are. Please also note that 
none of the policies will migrate existing (al- 
ready allocated) pages to match the binding. 

The actual implementation of the in-kernel 
policies uses a struct mempolicy that is 
hung off the struct vm_area_struct. 
This choice involves some tradeoffs. The first 
is that, previous to the NUMA API, the per- 
VMA structure was exactly 32 bytes on 32- 
bit architectures, meaning that multiple vm__ 
area_structs would fit conveniently in a 
single cacheline. The structure is now a lit- 
tle larger, but this allowed us to achieve a per- 
VMA granularity to policied regions. This is 
important in that it is flexible enough to bind 
a single page, a whole library, or a whole pro- 
cess’ memory. This choice did lead to a sec- 
ond obstacle, however, which was for shared 
memory regions. For shared memory regions, 
we really want the policy to be shared amongst 
all processes sharing the memory, but VMAs 
are not shared across separate tasks. The solu- 
tion that was implemented to work around this 
was to create a red-black tree of “shared pol- 
icy nodes” for shared memory regions. Due 
to this, calls were added to the vm_ops struc- 
ture which allow the kernel to check if a shared 
region has any policies and to easily retrieve 
these shared policies. 

5.2 Syscall Entry Points 

1. sys_mbind(unsigned long start, unsigned 
long len, unsigned long mode, unsigned 
long Unmask, unsigned long maxnode, 
unsigned flags); 

Bind the region of memory [start, 
start + len) according to mode and 
flags on the nodes enumerated in 
nmask and having a maximum possible 
node number of maxnode. 

2. sys_set_mempolicy(int mode, unsigned 



long *nmask, unsigned long maxnode); 

Bind the entire address space of the cur- 
rent process according to mode on the 
nodes enumerated in nmask and hav- 
ing a maximum possible node number of 

maxnode. 

3. sys_get_mempolicy(int *policy, unsigned 
long *nmask, unsigned long maxnode, 
unsigned long addr, unsigned long flags); 

Return the current binding’s mode in 
policy and node enumeration in 
nmask based on the maxnode, addr, 
and flags passed in. 

In addition to the raw syscalls discussed above, 
there is a user-level library called “libnuma” 
that attempts to present a more cohesive inter- 
face to the NUMA API, topology, and sched- 
uler affinity functionality. This, however, is 
documented elsewhere. 

5.3 At inbind ( ) Time 

After argument validation, the passed-in list of 
nodes is checked to make sure they are all on- 
line. If the node list is ok, a new memory policy 
structure is allocated and populated with the 
binding details. Next, the given address range 
is checked to make sure the vma’s for the re- 
gion are present and correct. If the region is ok, 
we proceed to actually install the new policy 
into all the vma’s in that range. For most types 
of virtual memory regions, this involves simply 
pointing the vma->vm_policy to the newly 
allocated memory policy structure. For shared 
memory, hugetlbfs, and tmpfs, however, it’s 
not quite this simple. In the case of a memory 
policy for a shared segment, a red-black tree 
root node is created, if it doesn’t already exist, 
to represent the shared memory segment and 
is populated with “shared policy nodes.” This 
allows a user to bind a single shared memory 
segment with multiple different bindings. 
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5.4 At Page Fault Time 

There are now several new and differ- 
ent flavors of alloc_pages () style func- 
tions. Previous to the NUMA API, there 
existed alloc_page () , alloc_pages () 
and alloc_pages_node () . Without go- 
ing into too much detail, alloc_page() 
and alloc_pages () both called alloc_ 
pages_node ( ) with the current node id as 
an argument. alloc_pages_node ( ) allo- 
cated 2 order pages from a specific node, and 
was the only caller to the real page allocator, 
alloc_pages () . 



Figure 8: old alloc_pages 


With the introduction of the NUMA API, non- 
NUMA kernels still retain the old alloc_ 
page*() routines, but the NUMA alloca- 
tors have changed, al loc_pages_node ( ) 

and alloc_pages () , the core routines 

remain untouched, but all calls to alloc_ 
page ( ) /alloc_pages ( ) now end up call- 
ing alloc_pages_current ( ) , a new 
function. 
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There has also been the addition 
of two new page allocation func- 
tions: al loc__page_vma ( ) and 

alloc_page_interleave ( ) . 
alloc_pages_current () checks that the 
system is not currently in_interrupt () , 
and if it isn’t, uses the current pro- 
cess’s process policy for allocation. If 
the system is currently in interrupt con- 
text, alloc_pages_current ( ) falls 
back to the old default allocation scheme. 
alloc_page_interleave () allocates 
pages from regions that are bound with an 
interleave policy, and is broken out separately 
because there are some statistics kept for 
interleaved regions. alloc_page_vma ( ) 
is a new allocator that allocates only sin- 
gle pages based on a per-vma policy. The 
alloc_page_vma ( ) function is the only 
one of the new allocator functions that must be 
called explicity, so you will notice that some 
calls to al loc_pages ( ) have been replaced 
by calls to alloc_page_vma ( ) throughout 
the kernel, as necessary. 

5.5 Problems/Future Work 


There is no checking that the nodes re- 
quested are online at page fault time, so in- 
teractions with hotpluggable CPUs/memory 
will be tricky. There is an asymmetry be- 
tween how you bind a memory region and 
a whole process’s memory: One call takes 
a flags argument, and one doesn’t. Also 
the maxnode argument is a bit strange, 
the get/set_affinity calls take a number of 
bytes to be read/written instead of a max- 
imum CPU number. The alloc_page_ 
interleave () function could be dropped if 
we were willing to forgo the statistics that are 
kept for interleaved regions. Again, a lack of 
symmetry exists because other types of poli- 
cies aren’t tracked in any way. 


6 Legal statement 
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Figure 9: new alloc_pages 
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